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Abstract 

We present an algorithm, called the Offset Tree, for feaming to make decisions in situations where the payoff 
of oniy one choice is observed, rather than ail choices. The algorithm reduces this setting to binary classi- 
fication, allowing one to reuse of any existing, fully supervised binary classification algorithm in this partial 
information setting. We show that the Offset Tree is an optimal reduction to binary classification. In particular, 
it has regret at most {k — 1) times the regret of the binary classifier it uses (where k is the number of choices), 
and no reduction to binary classification can do better This reduction is also computationally optimal, both at 
training and test time, requiring just 0(log2 k) work to train on an example or make a prediction. 

Experiments with the Offset Tree show that it generally performs better than several alternative ap- 
proaches. 

Keywords: Supervised learning, active learning Bandits, Reinforcement Learning, Interactive Learning. 



1. Introduction 

This paper is about learning to make decisions in partial feedback settings where the payoff of only one 
choice is observed rather than all choices. 

As an example, consider an internet site recommending ads or other content based on such observable 
quantities as user history and search engine queries, which are unique or nearly unique for every decision. 
After the ad is displayed, a user either clicks on it or not. This type of feedback differs critically from the 
standard supervised learning setting since we don't observe whether or not the user would have clicked had a 
different ad beed displayed instead. 

In an online version of the problem, a policy chooses which ads to display and uses the observed feedback 
to improve its future ad choices. A good solution to this problem must explore different choices and properly 
exploit the feedback. 

The problem faced by an internet site, however, is more complex. They have observed many interactions 
historically, and would like to exploit them in forming an initial policy, which may then be improved by 
further online exploration. Since exploration decisions have akeady been made, online solutions are not 
applicable. To properly use the data, we need non- interactive methods for learning with partial feedback. 

This paper is about constructing a family of algorithms for non-interactive learning in such partial feed- 
back settings. Since any non-interactive solution can be composed with an exploration policy to form an 
algorithm for the online learning setting, the algorithm proposed here can also be used online. Indeed, some 
of our experiments are done in an online setting. 

Problem Definition 

Here is a formal description of non-interactive data generation: 

1. Some unknown distribution D generates a feature vector x and a vector r = (ri, r2, r^), where 
r,; G [0, 1] is the reward of the i-th action, i E {1, . . . , k}. Only x is revealed to the learner. 
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2. An existing policy chooses an action a G {1, . . . , k}. 

3. The reward is revealed. 

The goal is to learn a policy tt : X ^ {1, . . . ,k} for choosing action a given x, with the goal of maximizing 
the expected reward with respect to D, given by 

ri{n, D) = E(^_fj)^i3 V-^i^x)] ■ 

We call this a partial label problem (defined by) D. 

Existing Approaches 

Probably the simplest approach is to regress on the reward ra given x and a, and then choose according to 
the largest predicted reward given a new x. This approach reduces the partial label problem to a standard 
regression problem. 

A key technique for analyzing such a reduction is regret analysis, which bounds the "regret" of the 
resulting policy in terms of the regressor's "regret" on the problem of predicting ra given x and a. Here regret 
is the difference between the largest reward that can be achieved on the problem and the reward achieved by 
the predictor; or — defined in terms of losses — the difference between the incurred loss and the smallest 
achievable loss. One analyzes excess loss (i.e., regret) instead of absolute loss so that the bounds apply to 
inherently noisy problems. It turns out that the simple approach above has regret that scales with the square 
root of the regressor's regret (see section|6]for a proof). RecalUng that the latter is upper bounded by 1, this 
is undesirable. 



Another natural approach is to use the technique in (l25h . Given a distribution p{a) over the actions given 
X, the idea is to transform each partial label example [x, a, ra,p(a)) into an importance weighted multiclass 
example {x,a^ra/p{a)), where ra/p{a) is the cost of not predicting label a on input x. These examples 
are then fed into any importance weighted multiclass classification algorithm, with the output classifier used 
to make future predictions. Section |6] shows that when p{a) is uniform, the resulting regret on the original 
partial label problem is bounded by k times the importance weighted multiclass regret, where k is the number 
of choices. The importance weighted multiclass classification problem can, in turn, be reduced to binary 
classification, but all known conversions yield worse bounds than the approach presented in this paper. 



Results 

We propose the Offset Tree algorithm for reducing the partial label problem to binary classification, allowing 
one to reuse any existing, fully supervised binary classification algorithm for the partial label problem. 

The Offset Tree uses the following trick, which is easiest to understand in the case of fc = 2 choices 
(covered in section[3]i. When the observed reward Va of choice a is low, we essentially pretend that the other 
choice a' was chosen and a different reward r^, was observed. Precisely how this is done and why, is driven 
by the regret analysis. This basic trick is composable in a binary tree structure for fc > 2, as described in 
section |4] 

The Offset Tree achieves computational efficiency in two ways: First, it improves the dependence on 
k from 0{k) to 0(log2 k). It is also an oracle algorithm, which implies that it can use the implicit opti- 
mization in existing learning algorithms rather than a brute-force enumeration over policies, as in the Exp4 
algorithm (3). We prove that the Offset Tree policy regret is bounded by fc — 1 times the regret of the binary 
classifier in solving the induced binary problems. Section |5] shows that no reduction can provide a better 
guarantee, giving the first nontrivial lower bound for learning reductions. Since the bound is tight and has a 
dependence on fc, it shows that the partial label problem is inherently different from standard fully supervised 
learning problems like fc-class classification. 

Section|6]analyzes several alternative approaches. An empirical comparison of these approaches is given 
in section]?] 
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Related Work 

The problem considered here is a non-interactive version of the contextual bandit problem (see (0; B S [3 
|23) for background on the bandit problem). The interactive version has been analyzed under various addi- 
tional assumptions (7; 13; 17; 18; 20; 22), including payoffs as a linear function of the side information ( 1; 2). 
The Exp4 algorithm (3) has a nice assumption-free analysis. However, it is intractable when the number of 
policies we want to compete with is large. It also relies on careful control of the action choosing distribution, 
and thus cannot be applied to historical data, i.e., non-interactively. 

Sample complexity results for policy evaluation in reinforcement learning and contextual bandits ((itI) 
show that Empirical Risk Minimization type algorithms can find a good policy in a non-interactive setting. 
The results here are mostly orthogonal to these results, although we do show in section |A] that a constant 
factor improvement in sample complexity is possible using the offset trick. 

The Banditron algorithm (7) deals with a similar setting but does not address several concerns that the 
Offset Tree addresses: (1) the Banditron requires an interactive setting; (2) it deals with a specialization 
of our setting where the reward for one choice is 1, and for all other choices; (3) its analysis is further 
specialized to the case where linear separators with a small hinge loss exist; (4) it requires exponentially in k 
more computation; (5) the Banditron is not an oracle algorithm, so it is unclear, for example, how to compose 
it with a decision tree bias. 

Transformations from partial label problems to fully supervised problems can be thought of as learning 
methods for dealing with sample selection bias (11 lb . which is heavily studied in Economics and Statistics. 



2. Basic Definitions 

This section reviews several basic learning problems and the Costing method (2^) used in the construction. 

A k-class classification problem is defined by a distribution Q over X y. Y, where X is an arbitrary 
feature space and F is a label space with \Y\ = k. The goal is to learn a classifier c : X ^ Y minimizing 
the error rate on Q, 

e(c, Q) = Pr(^_j^)^Q[c(a;) ^ y] = E(^_j^)^q[ l(c(a;) y)], 

given training examples of the form {x,y) ^ X xY. Here 1( ) is the indicator function which evaluates to 1 
when its argument is true, and to otherwise. 

Importance weighted classification is a generalization where some errors are more costly than others. 
Formally, an importance weighted classification problem is defined by a distribution P over X xY x [0, oo). 
Given training examples of the form {x,y,w) G X x Y x [0, oo), where w is the cost associated with 
mislabeling x, the goal is to learn a classifier c : X ^ Y minimizing the importance weighted loss on P, 
^(x,y,w)~p[w ■ l(c(x]_7^ y)]. 

A folk theorem (1261) says that for any importance weighted distribution P, there exists a constant w ~ 
^{x,y,w)~p[w] such that for any classifier c : X Y, 

E(2;,a)~Q[l(c(a:) ^ y)] = ^E(^_j,,^)^p[w • l{c{x) ^ y)], 
where Q is the distribution over X x Y defined by 

w 

Q{x,y,w) = =P{x,y,w), 
w 

marginalized over w. In other words, choosing c to minimize the error rate under Q is equivalent to choosing 
c to minimize the importance weighted loss under P. 

The Costing method (26i) can be used to resample the training set drawn from P using rejection sampling 
on the importance weights (an example with weight w is accepted with probability proportional to w), so that 
the resampled set is effectively drawn from Q. Then, any binary classification algorithm can be run on the 
resampled set to optimize the importance weighted loss on P. 
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Costing runs a base classification algorithm on multiple draws of the resampled set, and averages over 
the learned classifiers when making importance weighted predictions (see (26) for details). To simplify the 
analysis, we do not actually have to consider separate classifiers. We can simply augment the feature space 
with the index of the resampled set and then learn a single classifier on the union of all resampled data. The 
implication of this observation is that we can view Costing as a machine that maps importance weighted 
examples to unweighted examples. We use this method in Algorithms[T]and|2]below. 

3. The Binary Case 

This section deals with the special case of fc = 2 actions. We state the algorithm, prove the regret bound 
(which is later used for the general k case), and state a sample complexity bound. For simplicity, we let the 
two action choices in this section be 1 and —1. 

3.1 The Algorithm 

The Binary Offset algorithm is a reduction from the 2-class partial label problem to binary classification. 
The reduction operates per example, implying that it can be used either online or offline. We state it here for 
the offline case. The algorithm reduces the original problem to binary importance weighted classification, 
which is then reduced to binary classification using the Costing method described above. A base binary 
classification algorithm Learner is used as a subroutine. 

The key trick appears inside the loop in Algorithm [T] where importance weighted binary examples are 
formed. The offset of 1 /2 changes the range of importances, effectively reducing the variance of the induced 
problem. This trick is driven by the regret analysis in section [3^ 



Algorithm 1: Binary Offset (binary classification algorithm Learner, 2-class partial label dataset S) 
set 5' = 

for each {x, a, ra,p{a)) G 5 do 

Form an importance weighted example 



Add {x, y, w) to S". 
return Learner(Costing(5')). 



3.2 Regret Analysis 

This section proves a regret transform theorem for the Binary Offset reduction. Informally, regret measures 
how well a predictor performs compared to the best possible predictor on the same problem. A regret trans- 
form shows how the regret of a base classifier on the induced (binary classification) problem controls the 
regret of the resulting policy on the original (partial label) problem. Thus a regret transform bounds only 
excess loss due to suboptimal prediction. 

Binary Offset transforms partial label examples into binary examples. This process implicitly transforms 
the distribution D defining the partial label problem into a distribution Q o over binary examples, via a 
distribution over importance weighted binary examples. Note that even though the latter distribution depends 
on both D and the action-choosing distribution p, the induced binary distribution Qd depends only on D. 




X, sign (a [ra - 1/2)) , 
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Indeed, the probability of label 1 given x and r, according to Qd, is 



rg - 1/2| 
p{a) 

- K'^i > \) 



1 ( a{ra - ^) > 



l(r_ 



< 



independent of p. 

The binary regret of a classifier c : X ^ { — IjljonQ^iis given by 

rege(c, Qd) = e{c,QD) - min e(c', Qi)), 

where the min is over all classifiers c':X^{l,— 1}. The importance weighted regret is definited similarly 
with respect to the importance weighted loss. 

For the k — 2 partial label case, the policy that a classifier c induces is simply the classifier. The regret of 
policy c is defined as 

reg^lcji*) ^ ma.xr]{c',D) ~ r]{c,D), 



where 



r]{c,D) = E(^^^.)^£, [rc(x)] 



is the value of the policy. 

The theorem below states that the policy regret is bounded by the binary regret. We find it surprising 
because strictly less information is available than in binary classification. Note that the lower bound in 
section |5] implies that no reduction can do better. Redoing the proof with the offset set to rather than 1/2 
also reveals that 2regg(c, Qd) bounds the policy regret, implying that the offset trick gives a factor of 2 
improvement in the bound. 

Finally, note that the theorem is quantified over all classifiers, which includes the classifier returned by 
Learner in the last line of the algorithm. 

Theorem 3.1 (Binary Offset Regret) For all 2-class partial label problems D and all binary classifiers c, 

reg^(c,D) < rege(c,gr,). 

Furthermore, there exists D such that for all values v € [0, 1] there exists c such that v = reg^(c, D) = 
regg(c, Qd) (i-S- the bound is tight). 

Proof We first bound the partial label regret of c in terms of importance weighted regret, and then apply 
known results to relate the importance weighted regret to binary regret. 

Conditioned on a particular value of x, we either make a mistake or we do not. If no mistake is made, 
then the regrets of both sides are 0, and the claim holds trivially. Assume that a mistake is made. Without 
loss of generality, ri > r_i and label —1 is chosen. The expected importance weight of label —1 is given by 



Eif^ 



1 



K - 1/2| • 1 (a(r„ - 1/2) < 0) 



r-i - 
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where we use the operator (Z)+ = Z ■ 1 {Z > 0). The difference in expected importance weights between 
label 1 and label —1 is 



D\x 



1 

2-'- 



D\x 



1 



1 



1 



r-i 





2 





2 



Ef~D|a;[?'i - r^i] = veg^{c,D\x). 
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This shows that the importance weighted regret of the binary classifier is the policy regret. The folk theorem 
from section 12] (see (1261) ) says that the importance weighted regret is bounded by the binary regret, times the 



expected importance. The latter is 'Eip^j^]^^ 



E„ 



-p(a) 



1 

P(a.) 



1/2| 



Ei;„ 



D[. 



[|ri-l/2| + |r_i-l/2|]< 



1, since both ri and r_i are bounded by 1. This proves the first part of the theorem. 

For the second part, notice that the proof of the first part can be made an equality by having a reward vector 
(0, 1) for each x always, and letting the classifier predict label 1 with probability (1 — w) over the draw of x. ■ 



4. The Offset Tree Reduction 

In this section we deal with the case of large k. 

4.1 The Offset Tree Algorithm 

The technique in the previous section can be applied repeatedly using a tree structure to give an algorithm 
for general k. Consider a maximally balanced binary tree on the set of k choices, conditioned on a given 
observation x. Every internal node in the tree is associated with a classification problem of predicting which 
of its two inputs has the larger expected reward. At each node, the same offsetting technique is used as in the 
binary case described in section[3] 

For an internal node v, let T{Ty) denote the set of leaves in the subtree r„ rooted at v. Every input to a 
node is either a leaf or a winning choice from another internal node closer to the leaves. 

The training algorithm. Offset Tree, is given in Algorithm|2] The testing algorithm defining the predictor 
is given in Algorithmic 

4.2 The Offset Tree Regret Theorem 

The theorem below gives an extension of Theorem [3T| for general k. For the analysis, we use a simple trick 
which allows us to consider only a single induced binary problem, and thus a single binary classifier c. The 
trick is to add the node index as an additional feature into each importance weighted binary example created 
algorithm|2l and then train based upon the union of all the training sets. 

As in section |3] the reduction transforms a partial label distribution D into a distribution Qd over binary 
examples. To draw from Qd, we draw (x, r) from D, an action a from the action-choosing distribution p, 
and apply algorithmic to transform (x, r, a, p{a)) into a set of binary examples (up to one for each level in 
the tree) from which we draw uniformly at random. Note that Qd is independent of p, as explained in the 
beginning of section |3] 

Denote the policy induced by the Offset-Test algorithm using classifier c by tt^. For the following theorem, 
the definitions of regret are from section |3] 
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Algorithm 2: Offset Tree (binary classification algorithm Learner, partial label dataset S) 



Fix a binary tree T over the choices 
for each internal node v in order from leaves to root do 
Set = 

for each (x, a, ra,p{a)) G S such that a G r(T„) and all nodes on the path v ^ a predict a on a; 
do 

Let a' be the other choice at v and 

y = 1 (a' comes from the left subtree of v) 

if < 1/2, 

add(.,,,^i^^l±^(l/2-.„))to5„ 

p(a) + p(a') , , , , 

else add {x, 1 - y, ^ , ^ - 1/2 

Let Ct, = Learner(Costing(iS'i,)) 
return c~ 



Algorithm 3: Offset Test (classifiers {c^,}, unlabeled example x) 
return unique action a for which every classifier c„ from a to root prefers a. 



Theorem 4.1 (Offset Tree Regret) For all k-class partial label problems D, for all binary classifiers c, 

v{a,a')£T 

<{k- l)regg(c,Qr,), 

where w(a, a') ranges over the [k — 1) internal nodes in T, and a and a' are its inputs determined by c's 
predictions. 

Note: Section|5]shows that no reduction can give a better regret transform theorem. With a little bit of side 
information, however, we can do better: The offset minimizing the regret bound turns out to be the median 
value of the reward given x. Thus, it is generally best to pair choices which tend to have similar rewards. 
Note that the algorithm need not know how well c performs on Qd. 

The proof below can be reworked with the offset set to 0, resulting in a regret bound which is a factor of 
2 worse. 

Proof We fix x, taking the expectation over the draw of x at the end. The first step is to show that the partial 
label regret is bounded by the sum of the importance weighted regrets over the binary prediction problems in 
the tree. We then apply the costing analysis (26) to bound this sum in terms of the binary regret. 

The proof of the first step is by induction on the nodes in the tree. We want to show that the sum of the 
importance weighted regrets of the nodes in any subtree bounds the regret of the output choice for the subtree. 
The hypothesis trivially holds for one-node trees. 
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Consider a node u making an importance weighted decision between choices a and a'. The expected 
importance of choice a is given by 



■ p{a) +p(aO 

p(a) ' 

, p(a)-|-p(V) 
+P(a ) rT\ 1/2 - ra')+ 

= E,-^p|,[(r, - 1/2)+ + (1/2 - 7v)+]. 

It is important to note that, by construction, only two actions can generate examples for a given internal node. 
Without loss of generality, assume that a' has the larger expected reward. The expected importance weighted 
binary regret wreg„ of the classifier's decision is either if it predicts a', or 

wreg„ =^p^D\. [{r'a - 1/2)+ + (1/2 - r,)+] 

- -Ep^D\x [{ra - 1/2)+ + (1/2 - raO+] 

=^r^D\x[^/'2' -ra+ ra' - 1/2] = Ep^D\^[ra' - ra] 

if the classifier predicts a. 

Let Ty be the subtree rooted at node v, and let a be the choice output by T„ on x. If the best choice in 
T{Ty) comes from the subtree L producing a, the policy regret of Ty is given by 

Reg{Ty) ^ max Ep^^i^lry] - Ep^oixira] 



Reg(L) < wreg„ < ^ wreg„ . 



ueL ueTy 

If on the other hand the best choice comes from the other subtree R, we have 

Reg{Ty) = max Ef^Dlxiry] - 'Efr^D\x[ra] 
yer{R) I <- ^' I L J 

= Reg(i?) + F,p^D\x[ra'] - Ep^D\^[ra] 

< ^ wreg„ + wregy < ^ wreg„, 

proving the induction. 

The induction hypothesis applied to T tells us that Reg(T) < X]t,eT '^^^g^. According to the Costing 
theorem discussed in section |2] the importance weighted regret is bounded by the unweighted regret on the 
resampled distribution, times the expected importance. The expected importance of deciding between actions 
a and a' is 



'^f^D\x 



P{a)^\ra " 1/2| +p{a')-^\ra' - 1/2| 



< 1 



since all rewards are between and 1. Noting that Reg(r) — reg^(7rc, D\x),we thus have 

^eg,jinc,D\x) < (fc- l)regg(c,Qr, Ix), 

completing the proof for any x. Taking the expectation over x finishes the proof. ■ 

The setting above is akin to Boosting (0): At each round t, a booster creates an input distribution Df and 
calls an oracle learning algorithm to obtain a classifier with some error et on Dj. The distribution Dt depends 
on the classifiers returned by the oracle in previous rounds. The accuracy of the final classifier is analyzed 
in terms of et's. The binary problems induced at internal nodes of an offset tree depend, similarly, on the 
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classifiers closer to the leaves. The performance of the resulting partial label policy is analyzed in terms of 
the oracle's performance on these problems. (Notice that Theorem 14. 1 1 makes no assumptions on the error 
rates on the binary problems; in particular, it doesn't require them to be bounded away from 1/2.) 

For the analysis, we use the simple trick from the beginning of this subsection to consider only a single 
binary classifier. The theorem is quantified over all classifiers, and thus it holds for the classifier returned by 
the algorithm. In practice, one can either call the oracle multiple times to learn a separate classifier for each 
node (as we do in our experiments), or use iterative techniques for dealing with the fact that the classifiers are 
dependent on other classifiers closer to the leaves. 

5. A Lower Bound 

This section shows that no method for reducing the partial label setting to binary classification can do better. 
First we formalize a learning reduction which relies upon a binary classification oracle. The lower bound we 
prove below holds for all such learning reductions. 

Definition 5.1 (Binary Classification Oracle) A binary classification oracle O is a (stateful) program that 
supports two kinds of queries: 

1. Advice. An advice query 0{x, y) consists of a single example {x, y), where x is a feature vector and 
y E {1,-1} is a binary label. An advice query is equivalent to presenting the oracle with a training 
example, and has no return value. 

2. Predict. A predict query 0{x) is made with a feature vector x. The return value is a binary label. 
All learning reductions work on a per-example basis, and that is the representation we work with here. 
Definition 5.2 (Learning Reduction) A learning reduction is a pair of algorithms R and R^^. 

1. The algorithm R takes a partially labeled example {x, a, ra,p{a)) and a binary classification oracle O 
as input, and forms a (possibly dependent) sequence of advice queries. 

2. The algorithm R^^ takes an unlabeled example x and a binary classification oracle O as input. It 
asks a (possibly dependent) sequence of predict queries, and makes a prediction dependent only on the 
oracle's predictions. The oracle's predictions may be adversarial (and are assumed so by the analysis). 

We are now ready to state the lower bound. 

Theorem 5.1 For all reductions {R, R^^), there exists a partial label problem D and an oracle O such that 

Teg^{R~\0),D) > (fc - 1) reg,(0, R{D)), 

where R{D) is the binary distribution induced by R on D, and R^^(0) is the policy resulting from R~^ 
using O. 

Proof The proof is by construction. We choose D to be uniform over k examples, with example i having 1 
in its «-th component of the reward vector, and zeros elsewhere. The corresponding feature vector consists of 
the binary representation of the index with reward 1 . Let the action-choosing distribution be uniform. 

The reduction R produces some simulatable sequence of advice calls when the observed reward is 0. The 
oracle ignores all advice calls from R and chooses to answer all queries with zero error rate according to this 
sequence. 

There are two cases: Either R observes reward (with probability (fc — l)/fc) or it observes reward 1 
(with probability 1/fc). In the first case, the oracle has error rate (and, hence regret). In the second case, 
it has error rate (and regret) of at most 1. Thus the expected error rate of the oracle on R{D) is at most 1/fc. 

The inverse reduction R^^ has access to only the unlabeled example x and the oracle O. Since the ora- 
cle's answers are independent of the draw from D, the output action has reward with probability (fc — l)/fc 
and reward 1 with probability 1/fc, implying a regret of (fc — l)/fc with respect to the best policy. This is a 
factor of fc — 1 greater than the regret of the oracle, proving the lower bound. ■ 
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6. Analysis of Simple Reductions 

This section analyzes two simple approaches for reducing partial label problems to basic supervised learning 
problems. These approaches have been discussed previously, but the analysis is new. 

6.1 The Regression Approach 

The most obvious approach is to regress on the value of a choice as in AlgorithmlH and then use the argmax 
classifier as in Algorithm|5] Instead of learning a single regressor, we can learn a separate regressor for each 
choice. 



Algorithm 4: Partial-Regression (regression algorithm Regress, partial label dataset S) 
Let 5' = 

for each (x, a, ra) G do 
|_ Add ((x, a),ra) to S". 

return / = Regress(S"). 



Algorithm 5: Argmax (regressor /, unlabeled example x) 
return arg maxa f{x,a) 



The squared error of a regressor / : X ^ M on a distribution P over X x M is denoted by 

4(/,P)=E(,^,,)^p(/(x)-2;)2. 

The corresponding regret is given by reg^(/, P) — £r{f, P) — min// ^r(/', P)- 

The following theorem relates the regret of the resulting predictor to that of the learned regressor. 

Theorem 6.1 For all k-class partial label problems D and all squared-error regressors f, 

reg^(7^/:^) < V^kmgJf^PD), 

where Pd is the regression distribution induced by Algorithm^on D, and iTf is the argmax policy based on 
f. Furthermore, there exist D and h such that the bound is tight. 

The theorem has a square root, which is undesirable, because the theorem is vacuous when the right hand 
side is greater than 1 . 

Proof Let ttj choose some action a with true value Va = ^{x.r)'^D[^a]- Some other action a* may have 
a larger expected reward Va* > Va- The squared error regret suffered by / on a is E(2, ^^^([(''a — Va)'^ — 
{ra — f{x, a))^] = {va — f{x, a))^. Similarly for a*, we have regret (ua* — f{x, a*))^. In order for a to be 
chosen over a*, we must have f{x, a) > f{x, a*). Convexity of the two regrets implies that the minima is 
reached when f{x,a) — f{x,a*) = "°+'"°* ^ where the regret for each of the two choices is ) . The 

regressor need not suffer any regret on the other k — 2 arms. Thus with average regret °*2fc^° ^ regret of 
Va* — Va can be induced, completing the proof of the first part. For the second part, note that an adversary 
can play the optimal strategy outlined above achieving the bound precisely. ■ 
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6.2 Importance Weighted Classification 

Zadrozny (Esl) noted that the partial label problem could be reduced to importance weighted multiclass classi- 
fication. After Algorithm|6]creates importance weighted multiclass examples, the weights are stripped using 
Costing (the rejection sampling on the weights discussed in Section |2]l, and then the resulting multiclass dis- 
tribution is converted into a binary distribution using, for example, the all-pairs reduction jlOl) ). The last step 
is done to get a comparable analysis. 



Algorithm 6: IWC-Train (binary classification algorithm Learn, partial label dataset S) 
Let 5' = 

for each {x, a,p{a),ra) G do 
|_ Add (x,a, ^)to5'. 

return All-Pairs-Train (Learn, Costing(S")) 



All-Pairs-Train uses a given binary learning algorithm Learn to distinguish each pair of classes in the multi- 
class distribution created by Costing. The learned classifier c predicts, given x and a distinct pair of classes 
(i, j), whether class i is more likely than j given x. At test time, we make a choice using All-Pairs-Test, 
which takes c and an unlabeled example x, and returns the class that wins the most pairwise comparisons on 
X, according to c. 



Algorithm 7: IWC-Test (binary classifier c, unlabeled example x) 
return All-Pairs-Test(c, x). 



A basic theorem applies to this approach. 

Theorem 6.2 For all k-class partial label problems D and all binary classifiers c, 
reg^(7rc, D) < regg(c, QD){k - l)E(^_£^^o ^(1 - Ca) 

a 

< rege(c, QD)(fc - l)fc, 

where tTc is the IWC-Test policy based on c and Qu is the binary distribution induced by IWC-Train on D. 

Proof The proof first bounds the policy regret in terms of the importance weighted multiclass regret. Then, 
we apply known results for the other reductions to relate the policy regret to binary classification regret. 

Fix a particular x. The policy regret of choosing action a over the best action a* is Er^£)|2:[7'a«] — 
Er~D|a;[^a]- The importance weighted multiclass loss of action a is 

V- p{a')ra' V- 
t1 p(« ) T1 

since the loss is proportional to ^j^^fa' with probability p(a'). This implies the importance weighted regret 
of 

a' a'^a* 

which is the same as the policy regret. 

The importance weighted regret is bounded by the unweighted regret, times the expected importance (see 
(I26I)). which in turn is bounded by k. Multiclass regret on k classes is bounded by binary regret times fc — 1 
using the all-pairs reduction which completes the proof. ■ 
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Relative to the Offset Tree, this theorem has an undesirable extra factor of k in the regret bound. While this 
factor is due to the all-pairs reduction being a weak regret transform, we are aware of no alternative approach 
for reducing multiclass to binary classification that in composition can yield the same regret transform as the 
Offset Tree. 

7. Experimental Results 





Properties 




Single regressor 


k regressors 




Dataset 


k 


m 


Weighting 


M5P 


REPTree 


M5P 


REPTree 


Offset Tree 


ecoli 


8 


336 


0.3120 


0.5663 


0.3376 


0.3752 


0.3811 


0.2311 


flare 


7 


1388 


0.1565 


0.1570 


0.1685 


0.1570 


0.1592 


0.1506 


glass 


6 


214 


0.5938 


0.6662 


0.5846 


0.5800 


0.6077 


0.5000 


letter 


25 


20000 


0.3546 


0.6974 


0.5491 


0.4456 


0.5352 


0.3790 


lymph 


4 


148 


0.2953 


0.5267 


0.4622 


0.3422 


0.3400 


0.3114 


optdigits 


10 


5620 


0.1682 


0.5426 


0.4108 


0.1948 


0.2956 


0.1649 


page-blocks 


5 


5473 


0.0407 


0.0590 


0.0451 


0.0571 


0.0465 


0.0488 


pendigits 


10 


10992 


0.1029 


0.2492 


0.1840 


0.1408 


0.1774 


0.0976 


satimage 


6 


6435 


0.1703 


0.2027 


0.1968 


0.1787 


0.1878 


0.1853 


soybean 


19 


683 


0.6533 


0.8824 


0.7327 


0.7688 


0.7473 


0.5971 


vehicle 


4 


846 


0.3719 


0.6142 


0.5665 


0.3886 


0.4114 


0.3743 


vowel 


11 


990 


0.6403 


0.9034 


0.8919 


0.7440 


0.8198 


0.6501 


yeast 


10 


1484 


0.5406 


0.6626 


0.5679 


0.5406 


0.5697 


0.4904 



Table 1: Dataset-specific test error rates (see section ITTJ. Here k is the number of choices and m is the 
number of examples 

We conduct two sets of experiments. The first set compares the Offset Tree with the two approaches from 
section |6] The second compares with the Banditron on the dataset used in that paper. 

7.1 Comparisons with Reductions 

Ideally, this comparison would be with a data source in the partial label setting. Unfortunately, data of this sort 
is rarely available publicly, so we used a number of publicly available multiclass datasets {21) and allowed 
queries for the reward (1 or for correct or wrong) of only one value per example. 

For all datasets, we report the average result over 10 random splits (fixed for all methods), with 2/3 of 
the dataset used for training and 1/3 for testing. Figure [T] shows the error rates (in %) of the Offset Tree 
plotted against the error rates of the regression (left) and the importance weighting (right). Decision trees 
(J48 in Weka ('23')) were used as a base binary learning algorithm for both the Offset Tree and the importance 
weighting. For the regression approach, we learned a separate regressor for each of the k choices. (A single 
regressor trained by adding the choice as an additional feature performed worse.) MSP and REPTree, both 
available in Weka (23), were used as base regression algorithms. 

The Offset Tree clearly outperforms regression, in some cases considerably. The advantage over impor- 
tance weighting is moderate: Often the performance is similar and occasionally it is substantially better. 

We did not perform any parameter tuning because we expect that practitioners encountering partial label 
problems may not have the expertise or time for such optimization. All datasets tested are included. Note 
that although some error rates appear large, we are choosing among k alternatives and thus an error rate of 
less than 1 — 1/fc gives an advantage over random guessing. Dataset-specific test error rates are reported in 
Table □ 
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Figure 1 : Error rates (in %) of Offset Tree versus the regression approach using two different base regression 
algorithms (left) and Offset Tree versus Importance Sampling (right) on several different datasets 
using decision trees as a base classifier learner. 



7.2 Comparison with the Banditron Algorithm 

The Banditron (7) is an algorithm for the special case of the problem where one of the rewards is 1 and the 
rest are 0. The sample complexity guarantees provided for it are particularly good when the correct choice is 
separated by a multiclass margin from the other classes. 

We chose the Binary Perceptron as a base classification algorithm since it is the closest fully super- 



vised learning algorithm to the Banditron. Exploration was done according to Epoch-Greedy (Il7h instead 
of Epsilon-Greedy (as in the Banditron), motivated by the observation that the optimal rate of exploration 
should decay over time. The Banditron was tested on one dataset, a 4-class specialization of the Reuters 
RCVl dataset consisting of 673,768 examples. We use precisely the same dataset, made available by the 
authors of (0)- 

Since the Banditron analysis suggests the realizable case, and the dataset tested on is nearly perfectly 
separable, we also specialized the Offset Tree for the realizable case. In particular, in the realizable case we 
can freely learn from every observation implying it is unnecessary to importance weight by 1 /p{a). We also 
specialize Epoch-Greedy to this case by using a realizable bound, resulting in a probability of exploration 
that decays as 1 /t^^^ rather than 1 /t^/-^. 

The algorithms are compared according to their error rate. For the Banditron, the error rate after one 
pass on the dataset was 16.3%. For the realizable Offset Tree method above, the error rate was 10.72%. For 
the fully agnostic version of the Offset Tree, the error rate was 18.6%. These results suggest there is some 
tradeoff between being optimal when there is arbitrary noise, and performance when there is no or very little 
noise. In the no-noise situation, the realizable Offset Tree performs substantially superior to the Banditron. 



8. Discussion 

We have analyzed the tractability of learning when only one outcome from a set of k alternatives is known, 
in the reductions setting. The Offset Tree approach has a worst-case dependence on fc — 1 (Theorem 14. 11 ), 
and no other reduction approach can provide a better guarantee (Section|5]). Furthermore, with an 0(log k) 
computation, the Offset Tree is qualitatively more efficient than all other known algorithms, the best of which 
are 0{k). Experimental results suggest that this approach is empirically promising. 

The algorithms presented here show how to learn from one step of exploration. By aggregating informa- 
tion over multiple steps, we can learn good policies using binary classification methods. A straightforward 
extension of this method to deeper time horizons T is not compelling as A; — 1 is replaced by in the regret 
bounds. Due to the lower bound proved here, it appears that further progress on the multi-step problem in 
this framework must come with additional assumptions. 
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Appendix A. Sample Complexity Bound 

This section proves a simple sample complexity bound on the performance of Binary Offset. For ease of 
comparison with existing results, we specialize the problem set to partial label binary classification problems 
where one label has reward 1 and the other label has reward 0. Note that this is not equivalent to assuming 
realizability: Conditioned on x, any distribution over reward vectors (0, 1) and (1, 0) is allowed. 
Comparing the bound with standard results in binary classification (see, for example, Q), 

shows that 

the bounds are identical, while eliminating the offset trick weakens the performance by a factor of roughly 2. 

When a sample set is used as a distribution, we mean the uniform distribution over the sample set (i.e., an 
empirical average). 

Theorem A.l (Binary Offset Sample Complexity) Let the action choosing distribution be uniform. For all 
partial label binary classification problems D and all sets of binary classifiers C, after observing a set S of 
m examples drawn independently from D, with probability at least 1 — S, 



Proof First note that for partial label binary classification problems, the Binary Offset reduction recov- 
ers the correct label. Since all importance weights are 1, no examples are lost in converting from impor- 
tance weighted classification to binary classification. Consequently, the Occam's Razor bound on the devi- 
ations of error rates implies that, with probability 1 — 6, for all classifiers c € C, |e(c, Q_d) — e(c, Qd)! < 



SDM 2007. 




holds simultaneously for all classifiers c & C. Furthermore, if the offset is set to 0, then 
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-^(In \C\ + ln(2/(5))/2m, where the induced distribution Qe, \s, D with the two reward vectors encoded as 
binary labels. Observing that e{c,QD) = »?(c, D) finishes the first half of the proof. 

For the second half, notice that rejection sampling reduces the number of examples by a factor of two in 
expectation; and with probability at least 1 — (5/3, this number is at least m/2 — ^/m ln(3/(5). Applying the 
Occam's Razor bound with probabihty of failure 25/3, gives 



|e(c, (3d) - e{c,QD)\ < 



ln|C| +ln(3/(5) 



m — 2^TOln(3/5) 



Taking the union bound over the two failure modes proves that the above inequaUty holds with probability 
1 — 5. Observing the equivalence e(c, Qd) = v{c, D) gives us the final result. ■ 



The sample complexity bound provides a stronger (absolute) guarantee, but it requires samples to be inde- 
pendent and identically distributed. The regret bound, on the other hand, provides a relative assumption-free 
guarantee, and thus applies always. 
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